Skip to content

doc: add cluster manager reference architecture#1209

Draft
minaelee wants to merge 1 commit intocanonical:mainfrom
minaelee:cluster-manager-architecture
Draft

doc: add cluster manager reference architecture#1209
minaelee wants to merge 1 commit intocanonical:mainfrom
minaelee:cluster-manager-architecture

Conversation

@minaelee
Copy link
Contributor

@minaelee minaelee commented Feb 2, 2026

Add reference architecture documentation for MicroCloud Cluster Manager.

@github-actions github-actions bot added the Documentation Documentation needs updating label Feb 2, 2026
@minaelee minaelee force-pushed the cluster-manager-architecture branch from 84f08c8 to 2496821 Compare February 4, 2026 15:43
Copy link
Contributor

@edlerd edlerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent start. I have many thoughts and comments below. We can have a chat if you like to clarify on the open issues.


The MicroCloud Cluster Manager is a centralized tool that provides an overview of MicroCloud deployments. In its initial implementation, it provides an overview of resource usage and availability for all clusters. Future implementations will include centralized cluster management capabilities.

Cluster Manager stores the data from registered clusters in Postgres and Prometheus databases. This data can be displayed in the Cluster Manager UI, which also links to Grafana dashboards for each MicroCloud.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which also links to Grafana dashboards for each MicroCloud

This is a possible extension. By default, the COS stack is not available. So a user will deploy cluster manager and get the manager UI without links to Grafana.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this update work, or would you prefer we did not mention Grafana at all?

This data can be displayed in the Cluster Manager UI, which can be extended to link to Grafana dashboards for each MicroCloud.

Note: This information is from https://github.com/canonical/microcloud-cluster-manager/blob/main/ARCHITECTURE.md and likely should be updated there as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, suggestion sounds good to me. I'll take a note to update the architecture file.

Comment on lines +19 to +22
```{figure} ../images/cluster_manager_architecture.png
:alt: A diagram of Cluster Manager architecture
:align: center
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diagram is from an earlier development environment. It is mostly correct, but some things have slightly changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an updated diagram, or can you let me know what has changed and I can update it?

Copy link
Contributor

@edlerd edlerd Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have an updated diagram yet. Things that have changed:

  • A single TCP load balancer instead of two, exposing two different domain names. One for the management-api and one for the cluster-connector.
  • Postgres service / pg deployment and volume claims are "just" one thing: the postgres charm. The rest is detail of the PG charm. the diagram exposes to much detail of the PG charm internals with assumptions that might be wrong
  • Cert manager is to be replaced by a charm implementing the "certificate" charm interface. We might just change the label here.
  • k8s secrets/k8s config to be replaced by a juju config layer. Under the hood this is still true. I am not sure how to unify the levels of detail in the diagram to surface k8s internals and charm/juju internals.
  • management-api and cluster-connector live together on the same container. Each container is running those two processes. there can be multiple containers to scale out.
  • we might want to add Canonical observability stack as an optional extension to the diagram. With prometheus and grafana.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can create a task for myself to create an updated diagram.

That static external IP acts as the gateway to route user traffic to the appropriate Kubernetes load balancers.

TCP load balancers
: Two TCP load balancer services distribute traffic to the Management API and Cluster Connector deployments without terminating TLS. Instead, TLS termination is handled directly within each deployment application. This approach is particularly crucial for the Cluster Connector deployment, as it relies on mutual TLS (mTLS) authentication for secure communication.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using a single Traefik instance that is dealing with the incoming requests, no two load balancers anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to:

A TCP load balancer (using a Traefik instance) distributes traffic to the Management API and Cluster Connector deployments without terminating TLS.

Comment on lines +33 to +34
Certificate manager
: Manages TLS/SSL certificates for secure communication within the Kubernetes cluster. It stores secrets in Kubernetes to be used by various components. The certificates are used by both the Management API and Cluster Connector deployments for HTTPS encryption.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now rely on a charm that implements the certificates interface to provide certificates. This can be the self-signed-certificates charm, as suggested in the readme. We do not rely on the certificate manager k8s app anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the Certificate manager section in lines 33-34 above be removed entirely?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove it, yes.

Comment on lines +39 to +40
Persistent Volume (PV) and Persistent Volume Claim (PVC)
: The Persistent Volume is the storage resource provisioned for the Postgres deployment. The Persistent Volume Claim is the request for storage by the Postgres deployment to ensure data persistence.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We rely on the canonical Postgres charm. How that charm does persistent storage is outside our control.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What information should we provide in this section instead, or should we remove it entirely?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove it, yes.

(ref-cluster-manager-architecture-management-ui)=
### UI

The Management API deployment handles serving static assets for the UI. Users access information about clusters through the UI. Through it, users can create remote cluster join tokens and view information about existing tokens, as well as approve or reject join requests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The Management API deployment handles serving static assets for the UI. Users access information about clusters through the UI. Through it, users can create remote cluster join tokens and view information about existing tokens, as well as approve or reject join requests.
The Management API deployment handles serving static assets for the UI. Users access information about clusters through the UI. Through it, users can create remote cluster join tokens and view information about existing tokens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can expand here, we serve warnings and metric insights on a high level as well as a list of all registered clusters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Management API handles serving static assets for the UI. Users access information about clusters through the UI. Through it, users can create remote cluster join tokens and view information about existing tokens. The UI also serves warnings and metric insights on a high level.

I added this, but "on a high level" could bear more explanation. Do you mean through optional extension with Grafana, or something more/else?

Copy link
Contributor

@edlerd edlerd Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level means aggregates of instances and microclouds cluster members. Like the number of instances and their status distribution (how many are started/stopped/etc). If the cluster manager is extended with COS/Grafana stack, then Grafana indeed holds detailed information about each instance in every cluster.

- mTLS authentication check against the matched certificate
- Store and overwrite data in the `remote_cluster_details` table

To avoid overwhelming the Cluster Connector deployment, the status endpoints are rate limited. The response sent to the originating cluster includes a delay period (in seconds) that must pass before the next status signal request.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All endpoints are rate limited, not just this one.

Copy link
Contributor Author

@minaelee minaelee Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to:

(ref-cluster-manager-architecture-rate-limited)=
## Rate limited endpoints

To avoid overwhelming the Cluster Manager, all its endpoints are rate limited. When any endpoint receives a request from a cluster, the response from Cluster Manager includes a delay period (in seconds) that must pass before the next request to that endpoint.

Or did you mean all endpoints for the Cluster Connector deployment only?

Also: do you want to change the term "Cluster Connector deployment" to "Cluster Connector" (like with "Management API deployment" to "Management API") or does it make sense to keep the word "deployment" here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the suggestion is slightly confusing. We have rate limiting in place to avoid overwhelming the cluster manager, yes.

The functionality to signal to the Microcloud in a response when they should call in again is unrelated to the rate limiting, though.

Signed-off-by: Minae Lee <minae.lee@canonical.com>
@minaelee minaelee force-pushed the cluster-manager-architecture branch from 2496821 to a2c04aa Compare February 4, 2026 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Documentation Documentation needs updating

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants